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ABSTRACT 

The PDBTM database (available at http://pdbtm 
.enzim.hu), the first comprehensive and up-to-date 
transmembrane protein selection of the Protein 
Data Bank, was launched in 2004. The database 
was created and has been continuously updated 
by the TMDET algorithm that is able to distinguish 
between transmembrane and non-transmembrane 
proteins using their 3D atomic coordinates only. 
The TMDET algorithm can locate the spatial pos- 
itions of transmembrane proteins in lipid bilayer as 
well. During the last 8 years not only the size of the 
PDBTM database has been steadily growing from 
-^400 to 1700 entries but also new structural 
elements have been identified, in addition to the 
well-known ot-helical bundle and p-barrel structures. 
Numerous 'exotic' transmembrane protein struc- 
tures have been solved since the first release, 
which has made it necessary to define these new 
structural elements, such as membrane loops or 
interfacial helices in the database. This article 
reports the new features of the PDBTM database 
that have been added since its first release, and 
our current efforts to keep the database 
up-to-date and easy to use so that it may continue 
to serve as a fundamental resource for the scientific 
community. 

INTRODUCTION 

Transmembrane proteins play an important role in the 
living cells for energy production, regulation and metab- 
olism. The fact that half of present-day drugs have some 
effect on transmembrane proteins (1,2) also underlines 
their biological importance. Furthermore, ~25% of the 
human genome might code transmembrane proteins (3), 
which means about 5-6000 structures. Due to the struc- 
tural and physiochemical properties of these proteins, the 
experimental techniques for structure determination are 



not so straightforward. As a consequence, the proportion 
of transmembrane and globular proteins in the Protein 
Data Bank (PDB) (4) database is <2% according to the 
PDBTM database (5,6). Hence, the PDBTM database was 
created in 2004 to collect these cases. The PDBTM 
database was the first to address the problems of trans- 
membrane protein structures in the PDB database, namely 
the fact that these proteins cannot be identified using the 
annotation in the PDB's entries. Therefore, a new method 
was needed, which is based on only the 3D coordinates to 
identify transmembrane segments and does not require 
additional information. Moreover, since one of the most 
important environments, the double lipid layer, is not part 
of the solved atomic structures due to the experimental 
difficulties of structure determination, theoretical 
methods are required to determine the orientations of 
the transmembrane proteins relative to the lipid bilayer. 
We developed a method, called TMDET (7), which 
addresses and solves the above-mentioned problems. 
Since then several transmembrane databases have 
become available on the Internet, utilizing different theor- 
etical algorithms and techniques, and serving different 
purposes. For the sake of comparability, let us briefly 
summarize the main properties of such databases. 

The 0PM (8) contains a well-structured classification of 
membrane proteins. The orientation of the protein relative 
to the membrane normal is defined by minimizing its 
transfer energy (AGtransfer) from water to the lipid 
bilayer with respect to the shift along the bilayer 
normal, hydrophobic thickness, rotation angle and tilt 
angle (9). Some missing side-chain atoms are added and 
the structure of residues at the water-lipid interface is 
adjusted. The results of these calculations are used to 
transform the atomic coordinates of integral membrane 
proteins in a way that the membrane normal be parallel 
with the z-axis. In the OPM database, the transformed 
coordinate files contain membrane planes too, which are 
represented by dummy oxygen and nitrogen atoms. The 
topology data about transmembrane proteins are also 
given in the OPM database, i.e. what part of the 
proteins face to the cytosolic space and what part to the 
extra-cytosolic one. 
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The CGDB (10) database contains the final system co- 
ordinates of coarse-grained simulation-relaxed transmem- 
brane protein structures in bilayer and their analysis from 
the aspect of protein-hpid interaction. This database has 
the most sophisticated model in terms of physics, as it 
utilizes a previously developed high-throughput computa- 
tional approach to perform the coarse-grained simula- 
tions. There are two other analogous databases which 
are more specific: the KDB is for K-channels (http:// 
sbcb.bioch.ox.ac.uk/kdb/) and the OMPDB is a set of 
outer membrane proteins obtained by full-atom simula- 
tions (11). These databases contain indispensable informa- 
tion on dynamic aspects and stability. 

One of the most reliable database of membrane proteins is 
the membrane proteins of known structure (Mpstruct, 
http://blanco.biomol.uci.edu/membrane_proteins_xtal.html), 
which is regularly updated. In this, membrane proteins are 
classified using a simpler classification scheme than the one 
used by the OPM. Although the OPM and the PDBTM 
contain information about the membrane orientation of 
proteins and about the classification of sequence segments, 
the Mpstruct does not. 

There are several other databases collecting transmem- 
brane proteins and some of their properties (12-16): (i) the 
MPDB (12) is a relational database of structural and func- 
tional information on integral, anchored and peripheral 
membrane proteins and peptides derived from the htera- 
ture and from the PDB database. It provides various 
search parameters (protein characteristics, structure deter- 
mination methods, crystallization techniques, detergents, 
temperature, 'pH', authors, etc.) and records are hnked to 
the PDB, the Pfam (13) or the PubMed. It is a weekly 
updated database following the PDB weekly updates. In 
addition, the MPDB provides different statistics about the 
sources and the detergents used in crystallization, as well 
as about applied expression systems, among other data, 
(h) The TMFunction (14) is a collection of >2900 experi- 
mentally observed functional residues in membrane 
proteins. Each entry in the TMFunction database 
includes the numerical values for the parameters IC50, 
F(max), relative activity of mutants with respect to wild- 
type protein, binding affinity and dissociation constant, 
(iii) The Transporter Classification Database (15) is a 
web accessible, curated, relational database containing 
sequence, classification, structural, functional and evolu- 
tionary information about transport systems from a 
variety of living organisms. 

In the PDBTM database, we collect aU transmembrane 
proteins for which structures have been solved so far; we 
check and if necessary correct their biologically active 
oligomer form given in PDB files, define their membrane 
orientation and set their transmembrane segments, 
membrane re-entrant loops and interfacial helices (IFHs). 



NEW FEATURES OF THE PDBTM DATABASE 

Although the main architecture of the TMDET algorithm 
has not been changed, several extensions have been added 
to the basic algorithm to enhance the usabihty and reli- 
abihty of our database. The need for the new features is 



the consequence of the development this scientific field has 
experienced. We have enhanced the database to include 
those structural elements, which were not known or 
were rarely represented when the database was created. 
These are IFHs and re-entrant regions (loop, hairpin 
and re-entrant coil) (17). These and some other new 
features will be discussed in the following sections. 

Correcting biomatrices 

The biological form of the protein usually does not cor- 
respond to the molecule, which is present in the asymmet- 
ric unit. Therefore, the symmetry operations, which need 
to be applied to generate the active oligomer form, are 
displayed in the PDB file in the BIOMOLECULE 
section as a matrix transformation, called biomatrix. 
The oligomer form usually is defined by the authors or 
is calculated by theoretical calculations using PQS (18) 
or PISA (19). Both of these algorithms have been de- 
veloped to determine the quaternary structure of 
globular proteins, therefore they may fail when applied 
to transmembrane proteins. We have found several files, 
where the crystals contain the biologically active oligomer 
form, but the BIOMOLECULE records are set improp- 
erly (e.g. 2atk, 2jk5, 2zld) and those, where the crystals 
contain ohgomer forms that do not exist in the membrane. 
These latter cases cannot be recognized by the above-men- 
tioned methods. Most frequently they are subunits with 
anti-parallel orientation in a homo-dimer transmembrane 
protein, which were discussed in our original article (5). 
The usage of inappropriate biomatrices occasionally leads 
to the inaccurate definitions of the orientation of 
membrane proteins relative to the membrane. In some 
cases, it could be a ~20° or a larger difference between 
monomer and oligomer forms. 

We aimed to identify and correct problems, which can 
be associated with biomatrices and leads to incorrect 
oligomers. Therefore, we developed a new algorithm, 
which uses homologous protein structures to generate 
biomatrices for proteins with inappropriate biomatrix in 
the PDB. The outhne of the protocol is as foUows. Protein 
structures having only one chain without any biomatrix 
annotation (or only the identity matrix is given in the 
biomatrix records) are selected in one pool, whereas 
those which have only one chain and a biomatrix were 
stored in an other pool. Then a BLAST search is per- 
formed against the sequences of the second pool for 
each sequence of the first one. The protein with the 
highest hit is used as a candidate and if the sequential 
similarity is >90%, then the query structure will be 
superimposed on the candidate using TM-align (20) algo- 
rithm. TM-ahgn gives the transformation (T), which turns 

Pquery tO Ptarget formally: 

TPquery — Ptarget- (1) 

Assuming that there are Pquery and Ptarget identical 
monomer structures with different absolute coordinates 
and the corresponding biomatrices are Bquery and Btarget, 
then we get: 

TBquery Pquery — Btarget Ptarget ■ (2) 
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Figure 1. Loops (coloured in orange) in lh6i, a refined structure of 
human aquaporin (22). 



Replacing Ptarget with TPquery on the bases of Equation 
(1), in Equation (2), we obtain: 

TBquej.yPqye|.y = Bjarggf TPqu^ry . (3) 

Hence 

TBquery — BfargetT, (4) 
J^query — T BfargetT. (5) 

We have checked the accuracy of this procedure by 
applying it on those entries, which are homo-oligomer 
molecules and have correct BIOMOLECULE record. 
The PDBTM database contains 318 such entries. After 
sequence filtering to 90% identity, we got 57 entries. We 
could generate biomatrices for 43 entries using homolo- 
gous protein structures. After calculating the coordinates 
using these newly generated biomatrices, we calculate the 
root mean square deviation (RMSD) between the original 
and computed coordinates. The RMSD values of 40 out 
of the 43 entries were <1 A (avg: (D. 38 ± 0.20 A), while the 
worst ahgnment produced a 3.3 A RMSD. 

In cases, when the crystal contains the correct oligomer 
form, but this is not given in the BIOMOLECULE record, 
we supply the correct crystallographic symmetry trans- 
formation. Altogether, the biomatrices of 34 entries have 
been corrected. The largest tilt angle difference between 
the corrected and uncorrected original forms was found in 
the case of 2w0f, a potassium-channel KcsA-Fab complex 
with tetraoctylammonium. In the PDB file, it appears as a 
monomer (after applying the given biomatrix transform- 
ation), but its active form is tetranier. The angle deviation 
was 23° and the region borders moved up to four residues. 
We have found similar angle deviation in the 0PM 
database as well. The largest tilt angle deviation, 19° in 



Figure 2. IFH (coloured in green) in leVp, a quinol-fumarate reductase 
from Wolinetla succinogenes (28). 



the 0PM database, can be found between lpy6 and ImOl. 
Ipy6 is a monomeric protein in the PDB, while ImOl is a 
homo-trimer of the same bacteriorhodopsin. 

Membrane re-entrant loops 

Membrane re-entrant loops with both ends facing the 
same side of the membrane were first detected in the late 
90s (21) in the case of the cardiac Na^/Ca'+ exchanger. 
Later it was shown that several other channel-like trans- 
membrane proteins contain this type of structural element, 
e.g. aquaporins (22), potassium channels (23), chloride 
channels (24), etc. (Figure 1). We have developed a new 
algorithm as an extension of the TMDET to detect these 
structural elements using only the 3D atomic coordinates 
of given transmembrane proteins and the transformation 
matrices produced by the TMDET algorithm, by 
searching sequence segments having both end on the 
same side of the membrane, and diving into the 
membrane with at least 6 A (measured from the mem- 
brane-water interface). This algorithm can detect any 
type of re-entrant loops (e.g. helix-loop-coil, coil-loop- 
helix, coil-loop-coil), but the database currently does not 
contain these pieces of information. Currently, there are 
258 proteins in the PDBTM database, which contain one 
or more re-entrant loops. 

Interfacial helices 

Another newly implemented structural class is IFHs that 
are ot-helices laying in the membrane-water interface 
parallel to the membrane plane (Figure 2). They have 
various structural roles, for example, they are responsible 
for the regulation of channel gating in both the KirBac 1 . 1 
inward rectifying potassium channel (25) and the MscS 
mechanosensitive channel (26), while in photosystem I, 
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IFHs appear to shield cofactors from the aqueous 
phase (27). 

A further extension of the TMDET algorithm contains 
a subroutine which identifies these regions. First, we 
collect a-helical regions not in the membrane, and 
longer than four residues, and calculate the tilt angle 
relative to the membrane plane and the distance from 
the membrane-water boundary. The algorithm uses two 
threshold parameters: the distance (<9 A) from the mem- 
brane-water boundary and the tilt angle (<30°). As a 
result of this extension, we have identified IFHs in 851 
proteins. 



THE NEW USER INTERFACE OF THE PDBTM 

The homepage of the upgraded version of the PDBTM 
database utilizes the Wt C++ Web Toolkit (http://www. 
webtoolkit.eu/wt) programming hbrary and the 



OpenAstexViewer (29) to visualize transmembrane 
protein structures highlighted with different colours for 
the different region types to make the structure even 
more informative. We have recently created a complex 
web application for investigating protein 3D structures 
and residue-residue interactions (30), where both the Wt 
and the OpenAstexViewer have been successfully utilized. 

The PDBTM entry viewer 

The layout of the PDBTM molecule viewer can be seen in 
Figure 3. The navigation bar (Figure 3 A) contains an 
up-to-date hst of IDs of current transmembrane protein 
structures in the PDBTM database. The arrows serve for 
the navigation in this hst. The previous structure viewer 
has been replaced with the OpenAstexViewer (29). The 
colouring of the 3D structure (Figure 3B) and sequence 
(Figure 3C) is identical in order to help users to find 
sequence segments more easily in the 3D structure. 
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Figure 3. The PDBTM entry viewer. (A) The navigation bar which is always visible for the sake of comfortable and instant navigation. Using the 
arrows one can navigate to the first entry, step back, step forward or jump to the end. (B) The structure viewer (29), using the same colours as in the 
sequence box. (C) Sequence box, containing the chain selector and the sequence of the actual protein chain. (D) File download section, where 
the user can download or simply view the original and the transformed PDB files as well as PDBTM XML files. (E) Cross-reference links to the 
RCSB PDB and PDBsum (31) databases. 
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These two widgets are connected through signals, so by 
chcking on any sequence regions (except the grey-coloured 
ones, which represent sequence without solved structure), 
the representation of the corresponding residues in the 
structure viewer turns from cartoon to sphere. 

Users can download or simply view the original and the 
transformed PDB files as well as the PDBTM XML files 
(Figure 3D), which describe the regions of the structure, 
chain sequences and all the necessary information to build 
up the transformed PDB structure from the original one. 

Advanced search system 

The web server allows users to perform various types of 
search in the database. Some ordinary, frequently used 
search requests have already been implemented, but 
users can also query custom requests, either in a form 
field or by using the address fine of the browser. This 
latest feature enables the users to refer to their query 
results as a constantly updated hst by bookmarking the 
given query. The search results can be browsed or down- 
loaded as a whole in various file formats. For more 
detailed description visit the manual of the PDBTM 
(http://pdbtm.enzim.hu/7_ = /help/manual). 



CONCLUSION 

The PDBTM database is a comprehensive, up-to-date and 
continuously updated transmembrane protein database. 
As of today, it contains >1700 entries whose regions are 
classified into structural elements such as transmembrane 
helices, transmembrane beta segments, membrane 
re-entrant loops or IFHs. The flexible search method 
makes data mining easier for bioinformaticians who are 
interested in transmembrane proteins and their structures. 
AU kinds of feedback and advice are most welcome, as 
they will help us to improve and to satisfy the diverse 
demands of users more fully. 
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